AITopics

Country:

North America > United States > Florida > Palm Beach County > West Palm Beach (0.04)
North America > United States > Florida > Palm Beach County > Palm Beach (0.04)
North America > Canada > Ontario > Toronto (0.04)

Industry:

Automobiles & Trucks > Manufacturer (0.94)
Transportation > Ground > Road (0.94)
Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.71)

Neural Information Processing SystemsFeb-11-2026, 05:06:28 GMT

45e604a3e33d10fba508e755faa72345-Paper-Datasets_and_Benchmarks.pdf

blip2 caption, caption, synthetic caption, (13 more...)

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
North America > United States > Florida > Palm Beach County > West Palm Beach (0.04)
North America > United States > Florida > Palm Beach County > Palm Beach (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.68)

Industry:

Automobiles & Trucks > Manufacturer (0.93)
Transportation > Ground > Road (0.93)
Leisure & Entertainment > Sports (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsDec-24-2025, 22:56:23 GMT

Improving multimodal datasets with image captioning

Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity. The synthetic captions used in our experiments are now available on HuggingFace.

caption, multimodal dataset, name change, (5 more...)

Technology: Information Technology > Artificial Intelligence > Vision (0.85)

Fiastre, Gabriel, Yang, Antoine, Schmid, Cordelia

MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos

arXiv.org Artificial IntelligenceOct-31-2025

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.

caption, large language model, machine learning, (21 more...)

2510.14904

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Neural Information Processing SystemsOct-8-2025, 14:23:27 GMT

Raw_vs_synthetic_captions

Thao Nguyen

caption, machine learning, natural language, (14 more...)

Country:

North America > United States > Florida > Palm Beach County > West Palm Beach (0.04)
North America > United States > Florida > Palm Beach County > Palm Beach (0.04)
North America > Canada > Ontario > Toronto (0.04)

Industry:

Automobiles & Trucks > Manufacturer (0.94)
Transportation > Ground > Road (0.94)
Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.71)

Neural Information Processing SystemsOct-8-2025, 14:23:26 GMT

45e604a3e33d10fba508e755faa72345-Paper-Datasets_and_Benchmarks.pdf

blip2 caption, caption, synthetic caption, (13 more...)

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
North America > United States > Florida > Palm Beach County > West Palm Beach (0.04)
North America > United States > Florida > Palm Beach County > Palm Beach (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.68)

Industry:

Automobiles & Trucks > Manufacturer (0.93)
Transportation > Ground > Road (0.93)
Leisure & Entertainment > Sports (0.67)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Faghri, Fartash, Vasu, Pavan Kumar Anasosalu, Koc, Cem, Shankar, Vaishaal, Toshev, Alexander, Tuzel, Oncel, Pouransari, Hadi

MobileCLIP2: Improving Multi-Modal Reinforced Training

arXiv.org Artificial IntelligenceAug-29-2025

Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.

large language model, machine learning, natural language, (17 more...)

2508.20691

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Sahu, Ananya, Ananthram, Amith, McKeown, Kathleen

Mining Contextualized Visual Associations from Images for Creativity Understanding

arXiv.org Artificial IntelligenceJul-28-2025

Understanding another person's creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.

caption, large language model, machine learning, (19 more...)

2507.18915

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Vision (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Brack, Manuel, Katakol, Sudeep, Friedrich, Felix, Schramowski, Patrick, Ravi, Hareesh, Kersting, Kristian, Kale, Ajinkya

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

arXiv.org Artificial IntelligenceJun-23-2025

Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. W e also demonstrate that varying caption distributions introduce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide practical insights for more effective training data strategies in text-to-image generation.

caption, large language model, machine learning, (21 more...)

2506.16679

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceMay-20-2025

Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

Zhang, Xinsong, Zeng, Yarong, Huang, Xinting, Hu, Hu, Xie, Runquan, Hu, Han, Kang, Zhanhui

In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents following key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.3% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 15 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to identical images with alt-text. In 20 common cognitive domains, the model trained with our data outperforms the alt-text data by at least 7.5%. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark.

large language model, machine learning, natural language, (20 more...)

2504.13123

Country: Europe (0.67)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)